Initialization cells - run this first!

In [1]:
using Plots, Images, Interact, ProgressMeter, LinearAlgebra, Statistics, Arpack
gr(
    label="",
    markerstrokewidth=0.5
)
imshow(img, args...; kwargs...) = heatmap(reverse(img; dims=1), showaxis=:false, args...; kwargs...)

Unable to load WebIO. Please make sure WebIO works for your Jupyter client. For troubleshooting, please see the WebIO/IJulia documentation.

┌ Info: Precompiling ProgressMeter [92933f4c-e287-5a05-a399-4b506db050ca]
└ @ Base loading.jl:1186
Out[1]:
imshow (generic function with 1 method)

Matrix factorizations

Let $A$ be an $m \times n$ matrix. The range or column space of $A$ denoted by $\mathcal{R}(A)$ is the subspace of vectors formed from all linear combinations of its columns $A[:,1], \ldots, A[:,n]$. Let $r$ denote the rank of $A$, i.e., the number of linearly independent columns.

We would like to find $r$ basis vectors $w_{1}, \ldots, w_{r}$ for $\mathcal{R}(A)$ and express each column of $A$ as a linear combination with respect to these bases vectors. Thus, for example, we might express the first column as

$$A[:,1] = w_{1} x_{11} + \ldots w_{r} x_{r1},$$

and the second column as

$$A[:,2] = w_{1} x_{12} + \ldots w_{r} x_{r2},$$

and so on.

In matrix-vector notation, this yields the matrix factorization

$$ A = W X,$$

where $X$ is an $r \times n$ matrix and

$$A[:,j] = w_{1} x_{1j} + \ldots w_{r} x_{rj}.$$

There are many ways of expressing a matrix $A$ in such a form. As we shall shortly see, factorizations that are learned from the data yield interesting insights and enable new applications.

Let $e_i$ denote a sparse vector whose elements are all equal to zero except for the $i$-th element, which is equal to one.

Thus,

$$e_1 = \begin{bmatrix} 1 & \ldots & 0 \end{bmatrix}^T,$$

and so on.

In Julia:

function eᵢ(i, n)
    eᵢ = zeros(n)
    eᵢ[i] = 1
    return eᵢ
end

returns $e_i$ for the input arguments i and n.

The $n$ dimensional vectors $e_{1}, \ldots, e_{n}$ form an elementary basis for $\mathbb{R}^n$ -- they are often referred to as the standard or Euclidean basis vectors.

Learning the largest left and right singular vectors

The left and right singular vectors can be learned from $A$ via an optimization problem. This entails specifying the exact objective (also known as "cost" or "loss") function and the underlying constraints on the variable(s) being optimized. We begin by describing how we can learn the largest right singular vector.

Learning the largest right singular vector

Suppose we are given the matrix $A$. We consider a "best direction" problem that is formulated as follows: among all vectors $x$ constrained to have unit length or norm -- these are vectors on the unit hypersphere -- let $x_{\sf opt}$ denote the vector that maximizes the norm of the vector $Ax$.

Mathematically this is equivalent to the optimization problem

\begin{equation}\label{eq:manopt1} x_{\sf opt} = \textrm{arg max}_{||x||_2 = 1} \parallel A x \parallel_{2}, \end{equation}

This problem can be solved analytically via the singular value decomposition. Let the SVD of $A$ be given by

$$A = \sum_{i=1}^{r} \sigma_{i} u_i v_i^T,$$

where $\sigma_{1} \geq \ldots \geq \sigma_{r}$ are its singular values. Then

$$x_{\sf opt} = v_{1}.$$

In other words, $x_{\sf opt}$ is the right singular vector of $A$ associated with its largest singular value. If there are multiple singular values equal to the largest singular value, then any associated singular vector or their associated linear combination (which is also a singular vector) will also be a solution.

Recall that if $v_i$ is a right singular vector of $A$ associated with the singular value $\sigma_i$ and the left singular vector $u_i$ then

$$A v_i = \sigma_i u_i.$$

However, we also have that

$$A (-v_i) = \sigma_i (-u_i)$$

Thus $-u_i$ and $-v_i$ are also candidate left and right singular vectors associated with the singular value $\sigma_i$ and there is an inherent sign ambiguity associated with the singular vectors. The svd command in Julia always returns the same U and V matrices for the same A. The svds command does not -- more on that later.

Many numerical optimization packages minimize functions. So to use them, we will have to recast our problem of finding $u_1$ as a minimization problem instead of a maximization problem.

This is because the $x$ that minimizes $-||Ax||_2$ also minimizes $-||Ax||_2^2$. We now verify this numerically by solving the manifold optimization problem using the Optim.jl package and comparing the result to the theoretically predicted answer. To that end we will first define the function xhatMaximizes_norm_Ax_onSphere which takes as its input the matrix A and returns as its output xopt given by the numerical solution to Eq. (1).

In [2]:
using Optim, LinearAlgebra

function xthatMaximizes_norm_Ax_onSphere(
        A::AbstractMatrix,
        maxiters::Integer=1000,
        x0::Vector = randn(size(A,2))
    ) 
    m, n = size(A)
    x0 = normalize(x0) ## make it unit norm
    
    opt = Optim.optimize(
        x -> -norm(A*x,2), x0,      ## TODO: Add the objective function
        Optim.ConjugateGradient(
            manifold = Optim.Sphere() 
            ),
        Optim.Options(iterations = maxiters)
    )
    
    xopt = opt.minimizer
    
    return xopt
end
┌ Info: Precompiling Optim [429524aa-4258-5aef-a3af-852621145aeb]
└ @ Base loading.jl:1186
Out[2]:
xthatMaximizes_norm_Ax_onSphere (generic function with 3 methods)

Let us see if theory matches predictions.

In [3]:
A = randn(3, 3)
xopt = xthatMaximizes_norm_Ax_onSphere(A, 10)
U, s, V = svd(A)
u₁, v₁ = U[:,1], V[:,1] ## TODO: Match up with theory
@show xopt
@show v₁
@show v₁' * xopt
xopt = [0.266366, -0.780749, -0.565226]
v₁ = [-0.266366, 0.780749, 0.565226]
v₁' * xopt = -1.0
Out[3]:
-1.0

The inner product abs(v₁'xopt) is not exactly equal to one because of the termination criterion built into the solver. As we increase maxiters, the inner-product should get closer to one (so long as there is a unique $x$, aside from sign, that maximizes $||Ax||_2$). Of course, as the number of iterations is increased, the algorithm takes longer to converge.

Let us examine the effect of the number of interations on the closeness between using the code in the next cell.

In [4]:
maxiters = [10, 20, 40, 80, 160, 240, 320, 500, 1000, 1250, 1500]
innerproduct = zeros(length(maxiters))
@showprogress for idx in 1:length(maxiters)
   xopt = xthatMaximizes_norm_Ax_onSphere(A, maxiters[idx]); 
   innerproduct[idx] = (xopt'*v₁);
end
Progress: 100%|█████████████████████████████████████████| Time: 0:00:00

Let us plot the inner product thus computed as a function of the number of iterations.

In [5]:
plot(
    maxiters,
    innerproduct;
    ylim=(-1.1, 1.1),
    xlabel="maxiters",
    ylabel="inner product",
    marker=:circle
)
Out[5]:
0 500 1000 1500 -1.0 -0.5 0.0 0.5 1.0 maxiters inner product

The plot looks wobbly because the inner product jumps from $-1$ to $1$. This illustrates that the numerical solver sometimes returns an answer close to -v₁, and other times close to v₁. So let's plot the absolute value in the next cell.

In [6]:
plot(
    maxiters,
    abs.(innerproduct),
    ylim=(-1.1, 1.1), 
    label="abs. inner product", 
    xlabel="maxiters",
    marker=:square,
    color=:"red", 
)
Out[6]:
0 500 1000 1500 -1.0 -0.5 0.0 0.5 1.0 maxiters abs. inner product

Being able to recognize matrix vector products is an imortant skill in data science. We now describe a radically new application enabled by the mathematics we have just developed.

Application: Making "opaque" materials transparent

We typically think of objects such as milk, white paper, eggshells, fog, and snow as opaque because we cannot see through them. These objects are composed of millions and billions of randomly distributed micro-scatterers which frustrate the passage of light by acting as a maze to any incident light. Changing the angle of the incident light does not affect the amount that gets through because every direction is equally bad.

Associated with every such scattering medium is a scattering matrix, which encodes how light from every incident angle gets scattered into outgiong light of every angle. It turns out that if we excite a wavefront corresponding to the largest right singular vector of this scattering matrix, then we can make the material transparent!

The singular value corresponding to this perfectly transmitting wavefront is nearly 1, while singular values correspondiong to most other wavefronts are close to zero, reflecting the near opacity of the medium. Recent research has revealed that "most" such opaque media composed of non-absorbing scatterers have at least one perfectly transmitting eigen-wavefont regardless of how deep the medium is. These wavefronts can be efficiently found and controlled; the role of the SVD in changing how we think about opaquness is described here

Sculpting light to make opaque objects transparent

Click on the picture above to see a video illustrating this effect. For more details, see the write up here.

Learning the largest left singular vector

Since

$$||x^T A||_2 = ||A^T x||_2,$$

and given that if $A = U \Sigma V^T$ then $A^T = V\, \Sigma^T U^T$, we can conclude, via an application of the previously derived result that

\begin{equation} u_1(A) = \arg \max_{||x||_2 = 1} ||x^T A||_2 \end{equation}

We can numerically verify this via the next cell.

In [7]:
xopt = xthatMaximizes_norm_Ax_onSphere(A') 
@show xopt' * u₁
xopt_alt = xthatMaximizes_norm_Ax_onSphere(-A') 
@show xopt_alt' * u₁;
xopt' * u₁ = 0.9999999999999999
xopt_alt' * u₁ = -0.9999999999999998

Note that we used exploited the fact that $||x^TA||_2 = ||x^T(-A)||_2$. Being able to spot matrix-vector products is an important skill in computational data science.

Learning the right singuar vectors one at a time

We will now solve the manifold optimization problem

$$ x_{\sf opt} = \arg \max_{x} \parallel A\, x \parallel_{2} $$

subject to the spherical constraint

$$ ||x||_{2}^{2} = 1, $$

and the orthogonality constraint $ x \perp v_{1}, \ldots v_{k-1} $ with $v_{0} = 0$.

It can be shown that

$$x_{\sf opt} = v_k.$$

This provides a recipe for peeling off the singular vectors one at a time. We need to optimize over the set of vectors with unit norm that are orthogonal to the subspace spanned by the columns of the matrix

$$ V = \begin{bmatrix} v_{1} & \ldots & v_{k-1} \end{bmatrix} $$

We will now reformulate this optimization problem into one involving maximization on the unit sphere.

We implement the function manoptV to do precisely this in the next cell.

In [8]:
using Optim
using LinearAlgebra

function manoptV(A::AbstractMatrix, k::Integer=minimum(size(A)), maxiters::Integer=1000)
        
    m, n = size(A) 
    xopt = xthatMaximizes_norm_Ax_onSphere(A, maxiters)
    V = xopt
    
    for i in 2 : min(k, n - 1)
        P_ortho_V = I - V * V'
        xopt = xthatMaximizes_norm_Ax_onSphere(A * P_ortho_V, maxiters)
        V = hcat(V, xopt)
    end
    
    if k == n 
        vn = (I - V * V') * randn(n)
        vn = vn / norm(vn)
        V = hcat(V, vn)
    end
    return V
end
Out[8]:
manoptV (generic function with 3 methods)

Exercise:

Describe the algorithm in words and relate it to the approproiate optimizaiton problem involving projection matrices.

If we want to maximize $||Ax||_2$ the resulting x will be largest right singular vector. Now if we constrain x to be in the orthogonal plane corresponding to $v_1$, then x will be $v_2 \Rightarrow argmax||AP_{R(v_1)}^\perp x||_2 = v_2$. Similarly $argmax||AP_{R(v_1,...,v_k)}^\perp x||_2 = v_{k+1}$

We now compute and display the inner-product of Vmanopt and Vsvd. If they are identical then we will get a $k \times k$ identity matrix.

In [9]:
k = 3 ## extract k singular vectors
Vmanopt = manoptV(A, k)
Usvd, ssvd, Vsvd = svd(A)
@show Vmanopt' * Vsvd

heatmap(Vmanopt' * Vsvd; yflip=true, ticks=[1, 2, 3])
Vmanopt' * Vsvd = [1.0 4.24631e-9 -1.39249e-9; 4.27002e-9 -1.0 -9.12845e-12; 1.1737e-9 6.75377e-10 1.0]
Out[9]:
1 2 3 1 2 3 - 0.75 - 0.50 - 0.25 0 0.25 0.50 0.75 1.00

The off-diagonal elements are on the order of 1e-8 or lower, which is the default precision of the numerical solver used.

Consequently, the matrix abs(Vmanopt' * Vsvd) will typically be close to the identity matrix even when Vmanopt' * Vsvd is not, as illustrated in the next cell.

In [10]:
heatmap(abs.(Vmanopt' * Vsvd); yflip=true, ticks=[1, 2, 3])
Out[10]:
1 2 3 1 2 3 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

We verify this numerically in the next cell.

In [11]:
Umanopt = manoptV(A',k)
heatmap(abs.(Usvd' * Umanopt); yflip=true, ticks=[1, 2, 3])
Out[11]:
1 2 3 1 2 3 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Julia and other languages leverage LAPACK for computing the eigenvectors and singular vectors -- that computation is not via the optimization procedure but using more "direct" methods. The optimization viewpoint is important, even if not computationally feasible for larger matrices, because it shows how we can learn the singular vectors by appropriately defining the loss function and the orthonormality constraints.

We now describe a matrix factorization into so-called principal components that utilizes the SVD.

Principal Component Analysis & the PCA matrix factorixation

Let $A$ be an $m \times n$ matrix whose columns $A[:,1], \ldots, A[:,n]$ denote $n$ measurements of an $m$ dimensional multi-variate dataset. Let $x$ be an $m \times 1$ unit norm vector.

Step 1: Note that the elements of the row vector

$$ x^T A = \begin{bmatrix} x^T A[:,1], & x^T A[:,2], & \ldots & x^T A[:,n] \end{bmatrix},$$

can be interpreted as the coordinates of the columns of $A$ with respect to the basis vector $x$.

Step 2: In this viewpoint, the variance of the coordinates is given by

$$ \textrm{var}(x^T A) = \frac{1}{n}\sum_{i=1}^{n} \left(x^T A[:,i] -\frac{1}{n}\sum_{j} x^T A[:,j]\right)^2,$$

which can be rewritten as

\begin{equation} \textrm{var}(x^T A) = \frac{1}{n} \sum_{i=1}^{n} \left[x^T \left(A[:,i] -\frac{1}{n}\sum_{j} A[:,j]\right) \right]^2. \end{equation}

Step 3: Change of variables.

Let

$$ \overline{\mu}_A = \frac{1}{n} \sum_{i} A[:,i] = \frac{1}{n} A\,\mathbf{1},$$

be the mean (column) vector. Then we have that

$$ \textrm{var}(x^T A) = \frac{1}{n} \sum_{i=1}^{n} \left[x^T \left(A[:,i] -\overline{\mu}_A\right)\right]^2.$$

Let the centered data matrix $\overline{A}$ be defined as

$$\overline{A} = A - \overline{\mu}_A \mathbf{1}^T = A - \frac{1}{n} A \mathbf{1}\mathbf{1}^T.$$

Then the variance along the direction $x$ is given by

\begin{equation} \textrm{var}(x^T A) = \frac{1}{n} \parallel x^T \overline{A} \parallel_2^{2}. \end{equation}

Step 4: Putting it all together, we see that the problem of finding the "direction that maximizes variance" can be cast as the optimization problem

\begin{equation} x_{\sf opt} = \arg \max \textrm{var}(x^T A) = \frac{1}{n} \arg \max \parallel x^T \overline{A} \parallel_2^{2}, \end{equation}

subject to $\parallel x \parallel_2 = 1$.

We connect it to the manifold optimization problems considered earlier, next. M

This direction is the so-called Principal Component, and applying what we derived earlier, we have that

$$x_{\sf opt} = u_{1}(\overline{A}).$$

The so-called Principal Components are thus simply the left singular vectors of the centered data matrix $\overline{A}$. The function manoptPCs in the next cell computes the $k$ leading principal components.

In [12]:
using Optim, LinearAlgebra
using Statistics:mean

function manoptPCs(A::AbstractMatrix, k::Integer, maxiters::Integer=1000)
     m, n = size(A)
     centeredA = A .- mean(A, dims = 2) 
    return manoptV(centeredA', k, maxiters) 
end
Out[12]:
manoptPCs (generic function with 2 methods)

The covariance matrix

The covariance between two jointly distributed real-valued random variables $x$ and $y$ with finite second moments is defined as

$$\textrm{cov}(x,y) = E[xy] − E[x]E[y].$$

The variables are considerd to be uncorrelated if their covariance is zero. Note that $V[x] = E[x^2] - (E[x])^2$ is the variance of $x$ and will be greater than zero unless $x$ is a constant, non-random variable.

The cross covariance matrix of two vector valued random variables $x$ and $y$ is the matrix defined (in "dot" notation) as

\begin{equation} \textrm{cov}(x,y) = E.[xy^T] - E.[x] E.[y]^T. \end{equation}

The covariance matrix of a vector $x$ is the matrix of pair-wise covariances of the elements of $x$ and is defined as

\begin{equation} \Sigma_{x} = \textrm{cov}(x,x) = E.[xx^T] - E.[x] E.[x]^T. \end{equation}

The elements of $x$ are said to be uncorrelated if the covariance matrix $\Sigma_x$ is diagonal.

The sample covariance matrix of a matrix $X$ whose every row represents a random variable and whose columnns represent $n$ samples (or measurements) is given by

\begin{equation} \widehat{\Sigma}_{X} = \dfrac{1}{n} XX^T - \mu_X \mu_X^T, \end{equation}

where

$$\mu_X = \dfrac{1}{n} X \mathbf{1},$$

is the mean column vector of $X$.

Consequently, when the $n$ columns of $A$ are drawn from a multi-variate normal distribution with an arbitrary mean and covariance $\Sigma_{aa}$, then in the limit of $n\to \infty$ we expect

$$ \frac{1}{n} \overline{A}\,\overline{A}^T \to \Sigma_{aa},$$

so that the principal components will approximately equal the eigenvectors of the covariance matrix $\Sigma_{aa}$.

We illustrate this next for samples of a multivariate normal with zero mean and diagonal covariance.

In [13]:
n = 2500 # number of samples
CovarianceA = Diagonal([1, 2])

# Columns are Normal(0, CovarianceA) distributed:
A = sqrt(CovarianceA)*randn(2, n)
@show A * A' / n; # approximates CovarianceA
(A * A') / n = [0.971304 0.00561116; 0.00561116 1.98474]

We will use the following plotting function to illustrate data (as a scatter plot) along with principal components (as arrows):

In [14]:
"""
    p = scatter_quiver(A, U)

Given a 2-by-n matrix A, plot the second row coordinates against
the first row coordinates. Then add quiver arrows for the
coordinates encoded in U.
"""
function scatter_quiver(A::Matrix, Upc::Matrix; title="PC1 (red) & PC2 (black)", alpha=0.2)
    p = scatter(
        A[1, :], A[2, :],
        alpha=alpha, 
        aspect_ratio=:equal,
    )
    quiver!(
        p, 
        [0, 0], [0, 0], 
        title=title, 
        quivers=[(Upc[1, 1], Upc[1, 2]), (Upc[2, 1], Upc[2, 2])]; color=[:black, :red , :black]
    )
    return p
end
Out[14]:
scatter_quiver

We now compute the principal components and plot them on the point cloud. Note that they should be approximately equal to $e_2$ and $e_1$, in that order.

In [15]:
Upc = manoptPCs(A, 2)
scatter_quiver(A, Upc; alpha=0.15)
Out[15]:
-6 -3 0 3 6 -4 -2 0 2 4 PC1 (red) & PC2 (black)

We now replace the multivariate normal distribution with the uniform distribution and see that the Principal Components are still close to $e_1$ and $e_2$ provided $n$ is large enough (relative to $n$).

In [16]:
# centered A:
A = sqrt(CovarianceA) * (rand(2, n) - 0.5 * ones(2, n)) * sqrt(12)
Upc = manoptPCs(A, 2)
scatter_quiver(A, Upc)
Out[16]:
-4 -2 0 2 4 -2 -1 0 1 2 PC1 (red) & PC2 (black)

PCA factorization of a matrix

We wish to expresss the matrix $Y$ as $Y = W_{\sf pca} X_{\sf pca}$ where the $W_{\sf pca}$ is the orthogonal (or unitary) matrix whose columns are the principal component directions. The rows of $X_{\sf pca}$ are the principal coordinates.

Step 1:

We first express $Y$ as the sum of a mean matrix $\overline{Y}$ and the matrix $\widetilde{Y}$ whose columns have mean zero such that

$$Y = \overline{Y} + \widetilde{Y},$$

where

$$ \overline{Y} = \mu_Y \mathbf{1}^T,$$

and

$$\mu_Y = \dfrac{1}{n} \sum_{i=1}^{n} Y[:,i].$$

Step 2:

We then express $\widetilde{Y}$ via its SVD as $$ \widetilde{Y} = U \Sigma V^H.$$

Step 3:

With the $U$ thus returned by PCA, we can decompose

$$ Y=W_{\sf pca} X_{\sf pca},$$

as below.

This yields the PCA factorization (or decomposition) of a data matrix $Y$

Exercise:

Write the function pca_factorization which takes as its input the matrix Y and returns as its output Wpca and Xpca.

In [17]:
"""
    Wpca, Xpca = pca_factorization(Y)

* Input:  Data matrix Y
* Output: Factor matrices Wpca and Xpca such that Y = Wpca*Xpca
"""

using LinearAlgebra
using Statistics: mean

function pca_factorization(Y::AbstractMatrix)
    y_mean = mean(Y; dims=2) 
    Ymean = y_mean * ones(1, size(Y, 2))
    Ytil = Y - Ymean 
    
    U, s, V = svd(Ytil, full=false)
    
    Wpca = U*Diagonal(s) ##TODO: Fill in ??
    Winv = inv(Wpca) 
    Xpca = Winv * Ymean + V' ##TODO: Fill in ??
    
    return Wpca, Xpca
end
Out[17]:
pca_factorization (generic function with 1 method)

Let us verify that it is indeed a factorization using the code in the next cell. If the error is small then it is a factorization, otherwise it is not.

In [18]:
Y = randn(2, 1000)
Wpca, Xpca = pca_factorization(Y)
factorization_error = norm(Y - Wpca * Xpca)
println("Matrix factorization error is $(factorization_error)")
Matrix factorization error is 1.3726002025014376e-14

Application

We now illustrate how the PCA factorization of a data matrix can provide interesting insights on the data, via the principal coordinates.

To that end, we begin by loading a dataset as in the next cell.

In [19]:
using CSV, DataFrames
Xdata = CSV.read("breast-cancer-wisconsin.data.txt")
┌ Info: Precompiling CSV [336ed68f-0bac-5ca0-87d4-7b16caf5d00b]
└ @ Base loading.jl:1186
Out[19]:

698 rows × 11 columns

1000025511_11_221_331_41_52_1
Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64
1100294554457103212
210154253111223112
310162776881343712
410170234113213112
510171228101087109714
6101809911112103112
710185612121213112
810330782111211152
910330784211212112
1010352831111113112
1110361722111212112
1210418015333234414
1310439991111233112
14104457287510795544
1510476307464614314
1610486724111212112
1710498154111213112
181050670107764104124
1910507186111213112
201054590732105105444
211054593105536771014
2210567843111212112
2310570138451207314
2410595521111213112
2510657265234273614
2610663733211112112
2710669795111212112
2810674442111212112
2910709351131211112
3010709353111112112

The matrix Xdata thus loaded is $699 \times 11$ matrix. The rows represent different patients ($=$ "samples"); the columns represent different attributes given below.

Column Attribute Type of data
1. Sample code number id number
2. Clump Thickness 1 - 10
3. Uniformity of Cell Size 1 - 10
4. Uniformity of Cell Shape 1 - 10
5. Marginal Adhesion 1 - 10
6. Single Epithelial Cell Size 1 - 10
7. Bare Nuclei 1 - 10
8. Bland Chromatin 1 - 10
9. Normal Nucleoli 1 - 10
10. Mitoses 1 - 10
11. Class (2 for benign, 4 for malignant)

You can read more about the data here.

The columns 2-10 of this dataset contain the data we will analyze. Column 11 contains label information (whether a cell was "benign" or "malignant") that we will not use in the analysis but will use in the visualization of the analyzed data. To that end, we create a $9 \times 699$ matrix matrix A_breastcancer whose rows represent the various features and the columns represent different samples (associated with different patients). Some of the samples correspond to benign cells while others have malignant cells and so we will extract this label information for use later as below.

In [20]:
A_breastcancer = Matrix(Xdata[:,2:10])' 
benign = Xdata[:,end] .== 2
println("Number of bening samples = $(sum(benign))");
malignant = Xdata[:,end] .== 4; 
println("Number of malignant samples = $(sum(malignant))");
Number of bening samples = 457
Number of malignant samples = 241

We will now visualize the data in this matrix and ascertain from the visualization whether we can tell apart features corresponding to the benign cells or the malignant cells. To be able to display the data in two dimensions, we create a scatter plot of A_breastcancer[idx1,:] versus A_breastcancer[idx2,:] and vary idx1 and idx2.

Thus, for example when idx = 1 and idx2 = 2, we are plotting the value of the "Clump thickness" variable versus the value of the "Uniformity of Cell size" variable for each patient against each other.

We will label the "benign" and "malignant" samples in the visualization as illustrated next.

In [21]:
using Interact
@manipulate for idx1 = (1, 2, 3, 4, 5, 6, 7, 8, 9), idx2 = (2, 3, 4, 5, 6, 7, 8, 9, 1)
    scatter(A_breastcancer[idx1,benign],A_breastcancer[idx2,benign], marker=:square, color=:red, label = "benign", 
        legend =:bottomright)
    scatter!(A_breastcancer[idx1,malignant],A_breastcancer[idx2,malignant], color=:blue, label = "malignant")
    plot!(; xlabel = "Elements of row $idx1", ylabel = "Elements of row $idx2")
end
Out[21]:

Thus projecting the 9-dimensional data onto the canonical Euclidean coordinates corresponding to idx1 and idx2 does not yield any insights.

We now consider the PCA factorization of the same matrix and generate a scatter plot using the first and second principal coordinates.

In [22]:
Wpca, Xpca = pca_factorization(A_breastcancer)
scatter(Xpca[1,benign],Xpca[2,benign], 
    color=:red, marker=:square, label = "benign", legend=:bottomright,
    xlabel = "PCA Coordinate 1", ylabel = "PCA Coordinate 2")
p_pca = scatter!(Xpca[1,malignant],Xpca[2,malignant], color=:blue, label = "malignant")
plot!(; title = "PCA")
Out[22]:
-0.14 -0.12 -0.10 -0.08 -0.06 -0.04 -0.02 -0.15 -0.10 -0.05 0.00 0.05 0.10 PCA PCA Coordinate 1 PCA Coordinate 2 benign malignant

Thus the PCA factorization provides insights that the canonical factorization does not!

Independent Component Analysis & the ICA matrix factorization

PCA works well in many settings and is a workshorse algorithm for many machine learning and data science applications.

There are situations where a PCA matrix factorization does not yield insights but other factorizations do. We now consider one such factorization.

Motivation: Setting where PCA returns essentially random vectors

To motivate ICA, we perform PCA on a data matrix whose columns are drawn from the multivariate normal distribution except with an identity covariance matrix.

In this setup, the covariance can have as its eigenvectors any $2 \times 2$ orthogonal matrix. Consequently, the principal components will similarly be randomly distributed as almost every direction is "near equally good enough".

For a particular realization of $A$ there will be a "best direction" corresponding to the leading left singular vector of $\overline{A}$. For different realizations of $A$, however, the answer will change as we shall see next.

In [23]:
n = 1000
A = randn(2, 1000)
Upc = manoptPCs(A, 2)
scatter_quiver(A, Upc; title="PCs: Independent realization 1")
Out[23]:
-4 -2 0 2 4 -3 -2 -1 0 1 2 3 PCs: Independent realization 1
In [24]:
n = 1000
A = randn(2, 1000)
Upc = manoptPCs(A, 2)
scatter_quiver(A, Upc; title="PCs: Trial 2")
Out[24]:
-4 -2 0 2 4 -3 -2 -1 0 1 2 3 PCs: Trial 2

Finally, we conclude with an example of a non-multivariate normal with identity covariance where PCA yields random directions.

In [25]:
n = 1000
A = (rand(2, n) - 0.5 * ones(2, n)) * sqrt(12)
Upc = manoptPCs(A, 2)
scatter_quiver(A, Upc; title="PCs: Trial 1")
Out[25]:
-2 -1 0 1 2 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 PCs: Trial 1
In [26]:
n = 1000
A = (rand(2, n) - 0.5 * ones(2,n)) * sqrt(12)
Upc = manoptPCs(A, 2)
scatter_quiver(A, Upc; title="PCs: Trial 2")
Out[26]:
-2 -1 0 1 2 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 PCs: Trial 2

We might wonder if, in the above two examples, it is possible to recover the directions aligned with the axes. The geometry of the point cloud seems non-isotropic, and hence different than in the Gaussian setting where it appears to be isotropic. It turns out we can using a technique called Independent Component Analysis --which we describe next.

Independent component analysis

As before, let $\sum_{i} A[:,i] = 0$ and furthermore, assume that

$$ \dfrac{1}{n} AA^T = I.$$

Consider the manifold optimization problem

$$ x_{\sf opt} = \textrm{arg max} |\kappa_4(x^T A)| ,$$

subject to the spherical constraint

$$ \| x \|_2^{2} = 1,$$

where, for $y \in \mathbb{R}^{n}$, $\kappa_4(y^T)$ is the fourth central cumulant or kurtosis and is defined as

$$\kappa_4(y^T) = \dfrac{1}{n} \sum_i y_i^4 - 3 \left[ \dfrac{1}{n} \sum_i y_i^2 \right]^2.$$

This problem cannot be solved in closed form, but can be solved numerically as coded up in the next cell.

Exercise:

Write a function called kurtosis which takes as its input the vector x and the matrix A and returns as its output the kurtosis of the vector $x^H \,A$.

In [27]:
using Statistics:mean
kurtosis(x::Vector, A::AbstractMatrix) = mean(x -> x^4, x' * A) - 3 * mean(x -> x^2, x' * A)^2
Out[27]:
kurtosis (generic function with 1 method)

Thus while PCA finds directions that maximize the variance, ICA will find directions that maximize the absolute kurtosis. You can see this in the appropriate section of the code below.

Exercise:

Write a function called xthatMaximizes_kurt_xTA which takes as its inputs the matrix A, the number of iterations maxiters, and an initial starting vector x0 and returns as its output the unit-norm $x$ that maximizes the absolute kurtosis of $x^H A$.

Recall that the optimizer in the Optim package minimizes the objective function.

In [28]:
using Optim
using Statistics:mean
using LinearAlgebra

kurtosis(x::Vector, A) = mean(x -> x^4, x' * A) - 3 * mean(x -> x^2, x' * A)^2

function xthatMaximizes_kurt_xTA_onSphere(
        A::AbstractMatrix,
        maxiters::Integer=1000,
        x0::Vector = randn(size(A,1))
    ) 

    m, n = size(A)
    x0 = normalize(x0)
    opt = Optim.optimize(
        x -> kurtosis(x, A), x0,      ## TODO: Add the objective function
        Optim.ConjugateGradient(
            manifold=Optim.Sphere() 
            ),
        Optim.Options(iterations=maxiters)
    )
    xopt = opt.minimizer
    
    return xopt
end
Out[28]:
xthatMaximizes_kurt_xTA_onSphere (generic function with 3 methods)

The function below returns several independent components in a manner that is analogous to manoptPCs.

In [29]:
function manoptICs(A::AbstractMatrix, k::Integer=minimum(size(A)), maxiters::Integer=1000)
    
    m, n = size(A) 
    xopt = xthatMaximizes_kurt_xTA_onSphere(A, maxiters)
    U = xopt
    
    for i in 2:min(k, m - 1)
        P_ortho_U = I - U * U'
        xopt = xthatMaximizes_kurt_xTA_onSphere(P_ortho_U * A, maxiters)
        U = hcat(U, xopt)
    end
    
    if k == m 
        um = (I - U * U') * randn(m)
        um = um / norm(um)
        U = hcat(U, um)
    end
    
    return U
end
Out[29]:
manoptICs (generic function with 3 methods)

Exercise:

-- What is the optimization problem that yields $U[:,i]$ above (not to be confused with the $U$ of the SVD) as a solution?

-- What is the role played by the P_ortho matrix relative to the optimization problem?

Let $\kappa_4(x)$ denote the kurtosis of a random variable $x$. Then

$$u_{\sf ica, i} = \arg \max_x \color{red}{|𝜅_4(x'P^\perp_{R(u_1,...u_{i-1})}A)|},$$

subject to

$$ \color{red}{||x||_2^2 = 1}.$$

We run it on the dataset for which PCA seemingly failed and then illustrate to you that the directions it produces are interestingly different than what PCA produced.

In [30]:
@show Uic = manoptICs(A, 2);
Uic = manoptICs(A, 2) = [0.0214351 -0.99977; -0.99977 -0.0214351]
In [31]:
println("Sum of the absolute entries of Uic equals $(sum(abs.(Uic)))")
Sum of the absolute entries of Uic equals 2.042410611848315
In [32]:
scatter_quiver(A, Uic; title="ICs")
Out[32]:
-2 -1 0 1 2 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 ICs
In [33]:
n = 10000 # number of samples
Q = qr(randn(2, 10)).Q
println("Q = $Q")

A = Q * sqrt(12) * (rand(2, n) .- 0.5)

Upc = manoptPCs(A, 2)
println("Upca = $Upc")

Uic= manoptICs(A, 2)
println("Uica = $Uic")
Q = [-0.511043 -0.859555; -0.859555 0.511043]
Upca = [0.953734 -0.300652; 0.300652 0.953734]
Uica = [0.864582 -0.502492; -0.502492 -0.864582]

We now compare PCA versus ICA.

In [34]:
plot(
    scatter_quiver(A[:,1:10:end], Upc; title="PCA"),
    scatter_quiver(A[:,1:10:end], Uic; title="ICA")
)
Out[34]:
-2 -1 0 1 2 -3 -2 -1 0 1 2 3 PCA -2 -1 0 1 2 -3 -2 -1 0 1 2 3 ICA

Exercise:

We claim that ICA recovers a direction that aligns with the edges of the square, even as PCA does not.

Do you agree? Run the experiment above multiple times and remark on what you notice about the ICA recovered bases relative to what PCA recovers.

Yes, ICA recovers a direction that aligns with the edges of the square most of times, whereas PCA typically fails to do this.

Our claim is based on the fact that ICA is designed to unmix independent signals and so the ICA factorization yields as-independent-as-possible coordinates. In contrast, PCA yields uncorrelated coordinates. Independent coordinates are uncorrelated but uncorrelated coordinates are not independent#Examples_of_dependence_without_correlation) -- this is why ICA can yield additional insights in settings where PCA does not.

We will delve into this much further in a subsquent codex. For now, we will proceed with ICA as a black-box and develop the analog of the PCA factorization and see what we can do with ICA that we could not with PCA.

ICA factorization

Our goal is to decompose $Y$ as

$$ Y = W_{\sf ica} X_{\sf ica},$$

where the rows of $X_{\sf ica}$ are independent (or at least "as independent as possible"). This is in contrast to PCA, where the analogous $X_{\sf pca}$ has rows that are uncorrelated rather than independent.

We will choose a normalization so that $X_{\sf ica}$ is "white" -- in other words $X_{\sf ica} X_{\sf ica}^H = I$.

To that end, we first compute the (rank one) matrix with columns equal to the column-wise mean (or the average of the columns) of $Y$.

$$\overline{Y} = \mu_Y 1^T,$$

and

$$\mu_Y = \dfrac{1}{n} \sum_{i} Y[:,i].$$

Step 2:

We now express $\widetilde{Y}$ via the SVD as

$$\widetilde{Y} = U \Sigma V^H,$$

and apply ICA on $V^H$ to obtain the orthogonal (or unitary) matrix $Q_{\sf ica}$ that decomposes $V^H$ as

$$V^H = Q_{\sf ica} V_{\sf ica}^H.$$

The $V_{\sf ica}^H$ thus yielded has rows that are as independent as possible. (We will delve into greater detail of ICA in a subsequent codex).

Step 3:

With the $Q_{\sf ica}$ returned by ICA we can compute $V_{\sf ica}$ and decompose $Y$ as

$$Y = W_{\sf ica} X_{\sf ica},$$

as below.

We have derived the ICA factorization (or decomposition) of a data matrix $Y$.

Exercise: Complete the code below

In [35]:
function ica_factorization(Y::AbstractMatrix)
    
    μy = mean(Y; dims = 2)

    Ymean = μy * ones(1, size(Y, 2))

    Ytil = Y - Ymean

    U, s, V = svd(Ytil)
    
    S = Diagonal(s)
 
    Qica = manoptICs(sqrt(size(V, 1)) * V', size(Y, 1)) ## Whitening step 
    Vica = V*Qica    ## TODO: Match this with equations above 
    Wica = U*S*Qica    ## TODO: Match this with equations above 
    Xica = inv(Wica)* Ymean + Vica' ## TODO: Match this with equations
    return Wica, Xica
end
Out[35]:
ica_factorization (generic function with 1 method)

We now generate perform an ICA factorization of a matrix $Y$ and check to see that $Y = WX$.

In [36]:
Wica, Xica = ica_factorization(Y)
factorization_error = norm(Y - Wica * Xica)
println("Matrix factorization error is $(factorization_error)")
Matrix factorization error is 5.141800837530284e-14

Because the factorization error is small it means that we have exactly (to numerical precision) factorized the matrix. If the factorization error is not small, it is not a factorization.

Note that the ICA factorization is not unique because there are many ways of generating an $X_{\sf ica}$ that has independent components. For example, let $D$ be an arbitrary (invertible) diagonal matrix. Then, note that we have that

$$Y = (W_{\sf ica} D) (D^{-1} X_{\sf ica}),$$

and the matrix $D^{-1} X_{\sf ica}$ will also have independent components, except with different, rescaled by $1/d_i$ amplitudes for each row $i$.

The ICA factorization that we have derived above enforces the constraint that $X_{\sf ica} X_{\sf ica}^H = I$ -- here, too, the same non-uniqueness property holds if we pick arbitary $D = \textrm{diag}(\pm 1, \ldots, \pm 1)$.

We now use ICA to factorize the breast cancer dataset, from earlier.

In [37]:
Wica, Xica = ica_factorization(A_breastcancer)
benign = Xdata[:,end] .== 2
scatter(Xica[1,benign],Xica[2,benign], 
    color=:red, marker=:square, label = "benign", legend=:bottomright,
    xlabel = "Indepenent Coordinate 1", ylabel = "Independent Coordinate 2")
malignant = Xdata[:,end] .== 4
p_ica = scatter!(Xica[1,malignant],Xica[2,malignant], color=:blue, label = "malignant")
plot!(; title = "ICA")
Out[37]:
24.6 24.6 24.6 24.6 24.6 195 195 195 195 196 196 ICA Indepenent Coordinate 1 Independent Coordinate 2 benign malignant
┌ Warning: No strict ticks found
└ @ PlotUtils /home/nbuser/.julia/packages/PlotUtils/EybJR/src/ticks.jl:168
┌ Warning: No strict ticks found
└ @ PlotUtils /home/nbuser/.julia/packages/PlotUtils/EybJR/src/ticks.jl:168
┌ Warning: No strict ticks found
└ @ PlotUtils /home/nbuser/.julia/packages/PlotUtils/EybJR/src/ticks.jl:168
┌ Warning: No strict ticks found
└ @ PlotUtils /home/nbuser/.julia/packages/PlotUtils/EybJR/src/ticks.jl:168
┌ Warning: No strict ticks found
└ @ PlotUtils /home/nbuser/.julia/packages/PlotUtils/EybJR/src/ticks.jl:168
┌ Warning: No strict ticks found
└ @ PlotUtils /home/nbuser/.julia/packages/PlotUtils/EybJR/src/ticks.jl:168

We now compare PCA to ICA.

In [38]:
plot(p_pca, p_ica, layout = (1,2), size = (800, 400))
┌ Warning: No strict ticks found
└ @ PlotUtils /home/nbuser/.julia/packages/PlotUtils/EybJR/src/ticks.jl:168
┌ Warning: No strict ticks found
└ @ PlotUtils /home/nbuser/.julia/packages/PlotUtils/EybJR/src/ticks.jl:168
Out[38]:
-0.14 -0.12 -0.10 -0.08 -0.06 -0.04 -0.02 -0.15 -0.10 -0.05 0.00 0.05 0.10 PCA PCA Coordinate 1 PCA Coordinate 2 benign malignant 24.6 24.6 24.6 24.6 24.6 195 195 195 195 196 196 ICA Indepenent Coordinate 1 Independent Coordinate 2 benign malignant
┌ Warning: No strict ticks found
└ @ PlotUtils /home/nbuser/.julia/packages/PlotUtils/EybJR/src/ticks.jl:168
┌ Warning: No strict ticks found
└ @ PlotUtils /home/nbuser/.julia/packages/PlotUtils/EybJR/src/ticks.jl:168
┌ Warning: No strict ticks found
└ @ PlotUtils /home/nbuser/.julia/packages/PlotUtils/EybJR/src/ticks.jl:168
┌ Warning: No strict ticks found
└ @ PlotUtils /home/nbuser/.julia/packages/PlotUtils/EybJR/src/ticks.jl:168

Exercise:

-- What is the dfference between the Principal Coordinates and the Independent Coordinates?

-- Which one provies better discrimination between the benign and malignant samples?

Principal Coordinates are the ones that have the maximum variance where as the Independent Coordinates are the ones that are independent.

ICA provides better discrimination between the benign and malignant samples

PCA versus ICA in other applications

We now illustrate additional applications of ICA.

Unmixing images

In [39]:
using Images
image1 = "images/skyline1.jpeg"
load(image1)
Out[39]:
In [40]:
image1 = "images/skyline1.jpeg"
image2 = "images/skyline2.jpeg"


I1 = Float64.(Gray.(load(image1)))
sizeI1 = size(I1)
I2 = Float64.(Gray.(load(image2)))

p1 = imshow(I1, color=:grays, aspect_ratio=:equal, title="I1")
p2 = imshow(I2, color=:grays, aspect_ratio=:equal, title="I2")
plot(p1, p2, size=(900, 250))
Out[40]:
I1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 I2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

We now convert the images into vectors $s_1$ and $s_2$ and then form the matrix

$$ S = \begin{bmatrix} s_1^T \\ s_2^T \end{bmatrix},$$

and for an arbitrary $A$ matrix produce from $S$ the mixed variables

$$Y = A S,$$

as in the next code cell.

In [41]:
# run this cell only once
s1 = vec(I1) 
s2 = vec(I2);

Julia does not have an analog of MATLAB's clear function; once a name is defined in a Julia session (technically, in module Main), it is always present. The images we are loading here are rather large, and thus the memory consumed by the variable I1 is large. We can free the memory used by I1 by reassinging the variable as in the next cell.

In [42]:
I1 = 0.0
I2 = 0.0 ## this clears I1 and I2 from memory
Out[42]:
0.0

We now mix the images together and display the mixed images -- they should be imperceptible!

In [43]:
A = [0.5 0.5; 0.5 -0.5];
S = [s1 s2]';
Y = A * S;
S = 0.0; # clear S variable
mixed1 = reshape(Y[1, :] / maximum(Y[1, :]), sizeI1)
mixed2 = reshape(Y[2, :] / maximum(Y[2, :]), sizeI1)

p3 = imshow(
    mixed1, 
    color=:grays, 
    aspect_ratio=:equal, 
    title="mixed image 1"
)
p4 = imshow(
    mixed2, 
    color=:grays, 
    aspect_ratio=:equal, 
    title="mixed image 2 "
)
plot(p3, p4; size=(900, 250))
Out[43]:
mixed image 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 mixed image 2 - 1.00 - 0.75 - 0.50 - 0.25 0 0.25 0.50 0.75 1.00

We then apply the ICA factorization to the mixed variables -- this yields $W_{\sf ica}$ $S_{\sf ica}$. We display the rows of $S_{\sf ica}$ as images. If the images thus displayed match the original images then we have succeeded in unmixing the variables.

In [44]:
mixed1 = 0.0
mixed2 = 0.0 # clear mixed1 and mixed2 variables

Wica, Sica = ica_factorization(Y)
@show pinv(Wica) * A
Sica = abs.(Sica) ## make sign positive since it's an image

# Y = 0.0; # clear Y variable

unmixed1 = reshape(Sica[1,:] / maximum(Sica[1, :]), sizeI1)
unmixed2 = reshape(Sica[2,:] / maximum(Sica[2, :]), sizeI1)
Sica = 0.0; # clear Sica variable

p5 = imshow(unmixed1; color=:grays, aspect_ratio=:equal, title="Unmixed Image 1")
p6 = imshow(unmixed2; color=:grays, aspect_ratio=:equal, title="Unmixed Image 2")
unmixed1 = 0.0
unmixed2 = 0.0

plot(p3, p4, p5, p6; layout=4, size=(900, 600))
pinv(Wica) * A = [0.000495446 0.0186838; 0.0210159 -0.00316668]
Out[44]:
mixed image 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 mixed image 2 - 1.00 - 0.75 - 0.50 - 0.25 0 0.25 0.50 0.75 1.00 Unmixed Image 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Unmixed Image 2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

We now exmaine on of the unmixed images in greater detail and compare it to the original image

In [45]:
plot(p1, p6, size = (900,300)) #TODO: Replace ?? with either p1 or p2
Out[45]:
I1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Unmixed Image 2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Exercise:

Comment on how well ICA unmixes the images.

Are there regions within the image where the unmixed image is imperfect?

ICA unmixes the images very well, the final intensities in the images are slightly different between the original and the unmixed images. Some of the shadows from the buildings in the other image are still there.

In [46]:
##TODO: Unmix using PCA  -- show both mixed images versus both unmixed images
mixed1 = 0.0
mixed2 = 0.0 # clear mixed1 and mixed2 variables

Wpca, Spca = pca_factorization(Y)
@show pinv(Wpca) * A
Sica = abs.(Spca) ## make sign positive since it's an image

# Y = 0.0; # clear Y variable

unmixed1 = reshape(Spca[1,:] / maximum(Spca[1, :]), sizeI1)
unmixed2 = reshape(Spca[2,:] / maximum(Spca[2, :]), sizeI1)
Spca = 0.0; # clear Sica variable

p5 = imshow(unmixed1; color=:grays, aspect_ratio=:equal, title="Unmixed Image 1")
p6 = imshow(unmixed2; color=:grays, aspect_ratio=:equal, title="Unmixed Image 2")
unmixed1 = 0.0
unmixed2 = 0.0

plot(p3, p4, p5, p6; layout=4, size=(900, 600))
pinv(Wpca) * A = [-0.00826579 -0.016171; 0.0193285 -0.00987973]
Out[46]:
mixed image 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 mixed image 2 - 1.00 - 0.75 - 0.50 - 0.25 0 0.25 0.50 0.75 1.00 Unmixed Image 1 25 50 75 100 Unmixed Image 2 - 0.50 - 0.25 0 0.25 0.50 0.75 1.00
In [47]:
##TODO: Unmix using PCA  -- show both one of the umixed images and compare to original
plot(p1, p6, size = (900,300))
Out[47]:
I1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Unmixed Image 2 - 0.50 - 0.25 0 0.25 0.50 0.75 1.00

Exercise:

Comment on how well PCA unmixes the images.

Are there regions within the image where the unmixed image is imperfect?

PCA doesn't unmix the images very well. There is still a lot of residual from the other image.

Application: Unmixing audio signals

We will now use ICA to unmix audio signals. Such a situation arises when, for example, we have two microphones recording the conversations of two speakers. Note that we need as many mixed recordings as there are sources to unmix the original recordings. Before we begin, navigate to the diretory with the audio recordings and play them so you know what the original recordings sound like.

In [48]:
using WAV
┌ Info: Precompiling WAV [8149f6b0-98f6-5db9-b78f-408fbbb8ef88]
└ @ Base loading.jl:1186
In [49]:
# create mixed sources & save as .wav file
source_num = [2 5]; # pick two -- from 1, 2, 3, 4 ,5
# try different sources here!
source1 = "audio/source$(source_num[1]).wav"
source2 = "audio/source$(source_num[2]).wav"
Out[49]:
"audio/source5.wav"
In [50]:
s1, fs = load(source1)
s2, fs2 = load(source2)
min_s1s2_size = min(length(s1), length(s2))
# strip to same length
s1 = s1[1:min_s1s2_size]
s2 = s2[1:min_s1s2_size]
S = [s1 s2]'
Out[50]:
2×50000 Adjoint{Float64,Array{Float64,2}}:
  0.654902   0.513725    0.254902  …  0.0196078  0.027451  0.0823529
 -0.027451  -0.0666667  -0.027451     0.239216   0.388235  0.411765 

We now mix the recordings together and say the recordings as .wav files. Navigate to the directory where these recordings are saved and listen to them so you can see how they sound and that we have indeed mixed them!

In [51]:
A = [2 1.5; 1.5 2]; # mixing matrix
# Try different A matrices!

Y = A * S # mix the signals together
wavwrite(Y[1, :], "mixed1.wav", Fs=fs)
wavwrite(Y[2, :], "mixed2.wav", Fs=fs)
## open and play the .wav file in the directory

Audio of first mixed waveform

We now apply ICA to the mixed recordings and save them as .wav files -- play these files back to see how well have suceeded (or not). Then try it for different $A$ matrices.

In [52]:
Sica = ica_factorization(Y)[2]

invSica = (1.0 ./ maximum(abs.(Sica); dims=2))[:, 1] ## normalize to same peak amplitude
Sica = Diagonal(invSica) * Sica

wavwrite(Sica[1, :], "ica_unmixed1.wav", Fs=fs)
wavwrite(Sica[2, :], "ica_unmixed2.wav", Fs=fs)

s1, s2 = 0, 0
Out[52]:
(0, 0)
In [53]:
plot_ica = plot(plot(Sica[1,:], color =:blue, label = "", xlabel = "t", ylabel = "s1ica(t)"), 
                plot(Sica[2,:], color =:blue, label = "", xlabel = "t", ylabel = "s2ica(t)"), 
                layout = (2,1), title = "ICA")
        
plot_true = plot(plot(S[1 ,:], color=:black, xlabel = "t", ylabel = "s1(t)", title = "Original signals"), 
                 plot(S[2,:], color=:black, xlabel = "t", ylabel = "s2(t)") , 
                 layout = (2,1))

plot(plot_true,plot_ica, layout = (1,2), size = (900, 400))
Out[53]:
0 1×10 4 2×10 4 3×10 4 4×10 4 5×10 4 -0.5 0.0 0.5 1.0 Original signals t s1(t) 0 1×10 4 2×10 4 3×10 4 4×10 4 5×10 4 -1.0 -0.5 0.0 0.5 1.0 t s2(t) 0 1×10 4 2×10 4 3×10 4 4×10 4 5×10 4 -1.0 -0.5 0.0 0.5 1.0 ICA t s1ica(t) 0 1×10 4 2×10 4 3×10 4 4×10 4 5×10 4 -1.0 -0.5 0.0 0.5 ICA t s2ica(t)

We now listen to the audio to confirm that they were indeed recovered.

Audio of the first ICA unmixed waveform

Audio of the second ICA unmixed waveform

In [54]:
##TODO: Unmix using PCA -- your code here

Spca = pca_factorization(Y)[2]

invSpca = (1.0 ./ maximum(abs.(Spca), dims = 2))[:, 1] ## normalize to same peak amplitude
Spca = Diagonal(invSpca) * Spca 
wavwrite(Spca[1, :], "pca_unmixed1.wav", Fs=fs)
wavwrite(Spca[2, :], "pca_unmixed2.wav", Fs=fs)
Y  = 0
Out[54]:
0
In [55]:
plot_pca = plot(plot(Spca[1,:], color =:green, label = "", xlabel = "t", ylabel = "s1pca(t)"), 
                plot(Spca[2,:], color =:green, label = "", xlabel = "t", ylabel = "s2pca(t)"), 
                layout = (2,1), title = "PCA")
        
plot_true = plot(plot(S[1 ,:], color=:black, xlabel = "t", ylabel = "s1(t)", title = "Original signals"), 
                 plot(S[2,:], color=:black, xlabel = "t", ylabel = "s2(t)") , 
                 layout = (2,1))

plot(plot_true,plot_pca, layout = (1,2), size = (900, 400))
Out[55]:
0 1×10 4 2×10 4 3×10 4 4×10 4 5×10 4 -0.5 0.0 0.5 1.0 Original signals t s1(t) 0 1×10 4 2×10 4 3×10 4 4×10 4 5×10 4 -1.0 -0.5 0.0 0.5 1.0 t s2(t) 0 1×10 4 2×10 4 3×10 4 4×10 4 5×10 4 -0.5 0.0 0.5 1.0 PCA t s1pca(t) 0 1×10 4 2×10 4 3×10 4 4×10 4 5×10 4 -1.0 -0.5 0.0 0.5 1.0 PCA t s2pca(t)
In [56]:
plot(plot_true,plot_ica, plot_pca, layout = (1,3), size = (900, 400))
Out[56]:
0 1×10 4 2×10 4 3×10 4 4×10 4 5×10 4 -0.5 0.0 0.5 1.0 Original signals t s1(t) 0 1×10 4 2×10 4 3×10 4 4×10 4 5×10 4 -1.0 -0.5 0.0 0.5 1.0 t s2(t) 0 1×10 4 2×10 4 3×10 4 4×10 4 5×10 4 -1.0 -0.5 0.0 0.5 1.0 ICA t s1ica(t) 0 1×10 4 2×10 4 3×10 4 4×10 4 5×10 4 -1.0 -0.5 0.0 0.5 ICA t s2ica(t) 0 1×10 4 2×10 4 3×10 4 4×10 4 5×10 4 -0.5 0.0 0.5 1.0 PCA t s1pca(t) 0 1×10 4 2×10 4 3×10 4 4×10 4 5×10 4 -1.0 -0.5 0.0 0.5 1.0 PCA t s2pca(t)

Exercise:

Comment on how the plot of the waveforms receovered by PCA and ICA reveal which one succeeds in unmixing the mixed audio sources.

Both PCA and ICA donot produce waveforms that match the waveforms of the original audio signals that were mixed together.

We now compare the unmixed audio -- this is the ultimate test!

Audio of the first PCA unmixed waveform

Audio of the second PCA unmixed waveform

Exercise: How does the PCA unmixed audio sound?

The 2 audio signals still seem pretty mixed up. Hence PCA doesn't appear do be unmixing the signals very well.

Application: Unmixing waveforms

We now mix waveforms together and apply ICA to the mixed waveforms.

In [57]:
sawtooth(t, width=0.5) = mod(t, width)
square(t, width=0.5) = sign(sin(t / width))
Out[57]:
square (generic function with 2 methods)
In [58]:
t = -5:0.01:5
A = [1 0.5; 1 1]

@manipulate for s1type in ["sin", "square", "sawtooth"], 
    s2type in ["cos", "square", "sawtooth"]
    
    if s1type == "sin"
        s1 = sin.(2 * t)
    elseif s1type == "square"
        s1 = square.(t, 1.25)
    elseif s1type == "sawtooth"
        s1 = sawtooth.(t, 2.3)
    end
    
    if s2type == "cos"
        s2 = cos.(3 * t)
    elseif s2type == "square"
        s2 = square.(t)
    elseif s2type == "sawtooth"
        s2 = sawtooth.(t)
    end
    
    S = [s1 s2]'
    Y = A * S
    Sica =  ica_factorization(Y)[2]
    
    p1 = plot(t,S[1,:]; color=:red, title="Signal 1")
    p2 = plot(t,S[2,:]; color=:blue, title="Signal 2")

    p3 = plot(t,Y[1,:]; color=:red, title="Mixed Signal 1")
    p4 = plot(t,Y[2,:]; color=:blue, title="Mixed Signal 2")

    
    p5 = plot(t,Sica[1,:]; color=:red, title="ICA Unmixed Signal 1")
    p6 = plot(t,Sica[2,:]; color=:blue, title="ICA Unmixed Signal 2")

    plot(p1, p2, p3, p4, p5, p6; layout=(3, 2))
end
Out[58]:

Note that the unmixed signal does not have to have the same amplitude as the original signal for us to declare it as unmixed.

This is because of the normalization chosen in the ICA decomposition (of making the covariance matrix of the unmixed signal equal to the identity matrix). Consequently, the unmixed signal can also have flipped sign relative to the original signal.

We now compare ICA to PCA.

In [59]:
## TODO code here to unmix with PCA
t = -5:0.01:5
A = [1 0.5; 1 1]

@manipulate for s1type in ["sin","square","sawtooth"], s2type in ["cos","square","sawtooth"]
    if s1type == "sin"
        s1 = sin.(2 * t)
    elseif s1type == "square"
        s1 = square.(t, 1.25)
    elseif s1type == "sawtooth"
        s1 = sawtooth.(t, 2.3)
    end
    
    if s2type == "cos"
        s2 = cos.(3 * t)
    elseif s2type == "square"
        s2 = square.(t)
    elseif s2type == "sawtooth"
        s2 = sawtooth.(t)
    end
    
    S = [s1 s2]'
    Y = A * S;
    Spca =  pca_factorization(Y)[2]
    
    p1 = plot(t,S[1,:]; color=:red, title="Signal 1")
    p2 = plot(t,S[2,:]; color=:blue, title="Signal 2")

    p3 = plot(t,Y[1,:]; color=:red, title="Mixed Signal 1")
    p4 = plot(t,Y[2,:]; color=:blue, title="Mixed Signal 2")

    
    p5 = plot(t,Spca[1,:]; color=:red, title="PCA Unmixed Signal 1")
    p6 = plot(t,Spca[2,:]; color=:blue, title="PCA Unmixed Signal 2")

    plot(p1, p2, p3, p4, p5, p6; layout=(3, 2))
end
Out[59]:

Exercise:

Comment on the unmixing produced by PCA. Are there any pairs of signals that it does nearly unmix?

No, PCA doesn't unmix any of the signals well enough.

The reason PCA fails is because fundamentally the signals returned by PCA correspond to the right singular vectors of the SVD o the mean-centered matrix. Thus the PCA signals are intrinsically orthogonal to each other whereas the signals we mixed together are not. So PCA returns linear combinations of the mixed signals in such a way that they are orthgonal -- this fundammentally makes PCA unable to unmix such signals unless the signals were orthogonal to begin with and the mixing proces produced orthogonal signals.

ICA, on the other hand does not enforce orthogonality and so it has the additional freedom to unmix such signals that PCA does not. We will delve into this in greater detail in a subsequent codex.

Summary

There are many ways to factorize a data matrix. PCA and ICA are learned factorizations corresponding to specific loss functions and specific orthogonality constraints. We saw that PCA yields (an approximation of) the eigenvectors of the covariance matrix whereas ICA yields (as) independent (as possible) components.

Different projections reveal different information about the data -- as in this artwork by Sue Webster and Tim Noble

ICA and PCA yield different projections of the data and hence sometimes one is better than the others depending on what information is hidden in the data.

Finding patterns in data, even when we are seemingly given garbage in, is all about perspective and learning how to view things in just the right way.

Additional Exercises

Computational Reasoning about Statistical Independence

The notion of statistical independence of a pair of deterministic images is not obvious. Statistical independence of two random variables is easier to visualize. We begin with a formal definition of independence.

Mathematically, $x$ and $y$ are statistically independent if the joint distribution $f(x,y)$ factorizes so that $f(x,y) = f(x) f(y)$. Independence implies uncorrelatednesss but not nice versa -- uncorrelated variables are independent only if they are also jointly Gaussian.

Let us first visualize the scatter plot of two independent random variables.

In [60]:
scatter(rand(1000), randn(1000); alpha=0.3, xlabel="x", ylabel="y")
Out[60]:
0.00 0.25 0.50 0.75 1.00 -3 -2 -1 0 1 2 3 x y

An equivalent characterization of independent is that

$$f(y|x=x_0) = f(y).$$

In other words, the conditional distribution of $y$ for any value of $x = x_0$ is the same. Thus if we "scan" over the x axis in the above plot, then $x$ and $y$ are (close to being) independent if the marginal distribution of y values along any vertical "slice" at $x=x_0$, for every non-trivial $x_0$ is about the same.

We can now compare it to the scatter plot of two dependent random variables.

In [61]:
x = randn(1000)
y = randn(1000)
scatter(x, x + y; alpha=0.3, xlabel="x", ylabel="y")
Out[61]:
-4 -2 0 2 -4 -2 0 2 4 x y

An equivalent characterization of independent is that

Armed with this insight into gauging the independence of two variables, we can see if the original images are close to being independent using the command in the next cell.

In [62]:
using Random: randperm
subset = randperm(length(s1))[1:1000]
scatter(
    s1[subset], s2[subset];
    alpha=0.3,
    xlabel="Image 1 pixel values",
    ylabel="Image 2 pixel values"
)
BoundsError: attempt to access 1-element Array{Int64,1} at index [1:1000]

Stacktrace:
 [1] throw_boundserror(::Array{Int64,1}, ::Tuple{UnitRange{Int64}}) at ./abstractarray.jl:484
 [2] checkbounds at ./abstractarray.jl:449 [inlined]
 [3] getindex(::Array{Int64,1}, ::UnitRange{Int64}) at ./array.jl:735
 [4] top-level scope at In[62]:2

Gaussian signals have a kurtosis of zero. Thus ICA cannot unmix mixtures of Gaussians. It can, however, unmix a mixture of more than two independent signals so long as at most one signal has a kurtosis of zero.

There are other variants of ICA that utilize higher order cumulants and other loss functions such as entropy. Each loss function comes with its set of unmixability (or identifiability) conditions.

We will investigate these further in a subsequent codex.

Visualizing approximate signal independence when ICA works

When ICA works, as in it succesfully unmixes the signals, it is because the signals are approximately independent, in a statistical sense. In the next cell, consider examples from above where ICA does work and show you can visualize their relative approximate independence.

Image unmixing

In [63]:
##TODO: Code for displaying approximate independence of images where ICA successfully unmixes images

Signal unmixing

In [64]:
##TODO: Code for displaying approximate independence of signals where ICA succesfully unmixes signals

## Load and mix the signals here 
S = ?? 
subset = randperm(length(S[1, :]))[1:1000]  ## select a random subset of the pixels to make plot less cluttered 
scatter(S[1,subset],S[2,subset];
    label="", 
    color=:green, 
    alpha = 0.1, 
    xlabel ="signal 1 values", 
    ylabel = "signal 2 values")
syntax: invalid identifier name "?"

Audio unmixing

In [65]:
##TODO: Code displaying approximate independence of audio where ICA succesfully unmixes audio signals
# create mixed sources & save as .wav file

Visualizing approximate signal dependence when ICA does not work

As a counterpoint to our discussion above, when ICA does not succeed in unmixing the signals it is because the signals are statistically dependdent. In the next cell, we consider an example from the signal unmixing where ICA does not work to illustrate how we can visually spot relative approximate dependence.

In [66]:
##TODO: Generate two signal waveforms (they cannot be the same waveform :-)  -- mix them and show that ICA fails to recover them.
##      Use the same visualization as before to illustrate visually the dependence structure

Challenge problem: Unmixing mixed images that appear unmixable

In many real-world applications signals are not completely independent or dependent. In the next cell, we consider an example where ICA does not seem to work.

In [67]:
image1 = "images/face1.jpg"; image2 = "images/face2.jpg"; ## what are the original faces?

# Load images
I1 = Float64.(Gray.(load(image1)))
I2 = Float64.(Gray.(load(image2)))

sizeI1 = size(I1)
S = [vec(I1) vec(I2)]';

I1, I2 = 0, 0
#gc()
# Mix images
A = [0.5 0.5; 0.5 -0.5];
Y = A*S

#gc()

mixed1 = reshape(Y[1, :]/maximum(Y[1, :]), sizeI1)
mixed2 = reshape(Y[2, :]/maximum(Y[2, :]), sizeI1)

## TODO: attempt to unmix
W,Sica = ica_factorization(Y)
unmixed1 = reshape(Sica[1,:],sizeI1)
unmixed2 = reshape(Sica[2,:],sizeI1)
Sica , Y = 0, 0 
#gc()


p1 = imshow(mixed1,color=:grays, aspect_ratio=:equal, title = "Mixed Image 1 ")
p2 = imshow(mixed2,color=:grays,  aspect_ratio=:equal, title = "Mixed Image 2")
p3 = imshow(unmixed1,color=:grays,  aspect_ratio=:equal, title = "Unmixed Image 1")
p4 = imshow(unmixed2,color=:grays,  aspect_ratio=:equal, title = "Unmixed Image 2")

plot(p1,p2,p3,p4,layout=4)
Out[67]:
Mixed Image 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Mixed Image 2 - 0.75 - 0.50 - 0.25 0 0.25 0.50 0.75 1.00 Unmixed Image 1 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 Unmixed Image 2 - 0.0075 - 0.0050 - 0.0025 0 0.0025 0.0050

In the next plot we show a scatter plot of the pixel values the images revealing the strong correlation structure.

Exercise: What is the origin of the strong correlation? Hint: What does the alignment of the faces have to do with it?

In [68]:
subset = randperm(length(S[1,:]))[1:1000]
scatter(
    S[1,subset], S[2,subset], 
    alpha = 0.15, 
    xlabel="Image 1 pixel values", 
    ylabel="Image 2 pixel values"
)
Out[68]:
0.0 0.2 0.4 0.6 0.8 0.00 0.25 0.50 0.75 1.00 Image 1 pixel values Image 2 pixel values

As a whole the pixel values in the images are dependent because of the strong alignment of the faces. However, buried in this scatter plot is valuable information that there are ranges of pixel values in each each -- this corresponds to taking a square (or rectangular) sub-region of the scatter plot; within this range, the pixel values are almost independent.

Exercise:

Can you utilize this insights on the existence of an independent sub-region within the images to succesfully unmix them?

Hint: You will determine the unmixing matrix on just this (nearly) independent sub-region and then apply it to the whole image (which consists of mostly dependent parts).

In [69]:
## Hard  Problem: Code to unmix images above -- 
## Hint: can we relax the notion of independent "everywere" to independent "somewhere" and suceeed?  
## Hint: You will have to find the sub-region that is nearly indpendent, determine the unmixing matrix there 
##       and apply it to the rest of the image. 
image1 = "images/face1.jpg"; image2 = "images/face2.jpg"; ## what are the original faces?

# Load images
I1 = Float64.(Gray.(load(image1)))
I2 = Float64.(Gray.(load(image2)))

sizeI1 = size(I1)
S = [vec(I1) vec(I2)]';

I1, I2 = 0, 0
#gc()
# Mix images
A = [0.5 0.5; 0.5 -0.5];
Y = A*S

#gc()

mixed1 = reshape(Y[1, :]/maximum(Y[1, :]), sizeI1)
mixed2 = reshape(Y[2, :]/maximum(Y[2, :]), sizeI1)

## TODO: attempt to unmix
index1 = findall(x->x>=0.2,S[1,:])
index2 = findall(x->x>=0.2,S[2,:])
S1 =vec(I1)
S2 = vec(I2)

W,Sica = ica_factorization(Y[:,700:end])[2]
X = inv(W)*Y
unmixed1 = reshape(X[1,:],sizeI1)
unmixed2 = reshape(X[2,:],sizeI1)
Sica , Y = 0, 0 
#gc()


p1 = imshow(mixed1,color=:grays, aspect_ratio=:equal, title = "Mixed Image 1 ")
p2 = imshow(mixed2,color=:grays,  aspect_ratio=:equal, title = "Mixed Image 2")
p3 = imshow(unmixed1,color=:grays,  aspect_ratio=:equal, title = "Unmixed Image 1")
p4 = imshow(unmixed2,color=:grays,  aspect_ratio=:equal, title = "Unmixed Image 2")

plot(p1,p2,p3,p4,layout=4)
Out[69]:
Mixed Image 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Mixed Image 2 - 0.75 - 0.50 - 0.25 0 0.25 0.50 0.75 1.00 Unmixed Image 1 - 900 - 800 - 700 - 600 - 500 - 400 - 300 - 200 - 100 Unmixed Image 2 - 400 - 300 - 200 - 100 0 100 200 300 400

Another challenge problem

In [70]:
image1 = "challenge_mixed1.png"
image2 = "challenge_mixed2.png"
I1 = Float64.(Gray.(load(image1)))
I2 = Float64.(Gray.(load(image2)))
plot(imshow(I1;  color=:grays, aspect_ratio=:equal, title="Mixed Image 1 "), 
    imshow(I2;  color=:grays, aspect_ratio=:equal, title="Mixed Image 2"))
Out[70]:
Mixed Image 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Mixed Image 2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
In [71]:
##TODO: Unmix theimages

Bonus question:

Who are these individuals?